[Kaggle Pandas] Python basic code

Notice

Recent Posts

Recent Comments

Link

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

No Story, No Ecstasy

[Kaggle Pandas] Python basic code 본문

Data Science Series

[Kaggle Pandas] Python basic code

heave_17 2021. 4. 28. 22:03

# Pandas
import pandas as pd 

# Creating
pd.DataFrame({'Yes': [50,21], 'No': [131, 2]})
df = pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
                   'Sue': ['Pretty good.', 'Bland.']},
                  index=['Product A', 'Product B'])
print(df)
print(pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A'))

#Reading
df = pd.read_csv("asdf", index_col=0)

#Choosing between loc and iloc

When choosing or transitioning between loc and iloc, there is one "gotcha" worth keeping in mind, which is that the two methods use slightly different indexing schemes.

iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.

Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index something like df.loc['Apples', 'Potatoet'] (t coming after s in the alphabet).

This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999].

Otherwise, the semantics of using loc are the same as those for iloc.

#Manipulating the index
df.set_index("column name")

#Conditional Select
df.loc[df.column_name == 'column value']
df.loc[(df.column_name1 == 'column value1') & (df.column_name2 >= 'column value2')]
df.loc[df.column_name.isin(['cand1', 'cand2'])]
df.loc[df.column_name.notnull()]

#Assigning data
df['column_name'] = 'column_value'
df['column_name'] = range(len(df), 0, -1)

#Summary functions
df.column_name.describe() #count, mean, iqr, min/max, etc.
df.column_name.value_counts()

#Maps
mean = df.column_name.mean()
df.column_name.map(lambda x: x - mean)

def remain_func(row):
    row.points = row.points - mean
    return row
df.apply(remain_func, axis='columns')

df.column1 + "-" + df.column2

#Grouping

df.groupby('ref_column').target_column.count() # ref can be same with target
df.groupby('ref').apply(lambda x: x.title.iloc[0])
df.groupby(['ref1','ref2']).apply(lambda x: x.loc[x.target_column.idxmax()])

'''Another groupby() method worth mentioning is agg(), which lets you run a bunch of different functions on your DataFrame simultaneously. For example, we can generate a simple statistical summary of the dataset as follows:'''
df.groupby(['ref']).target_column.agg([len, min, max])

'''However, in general the multi-index method you will use most often is the one for converting back to a regular index, the reset_index() method:'''
countries_reviewed = reviews.groupby(['country', 'province']).description.agg([len])
countries_reviewed.reset_index()

#Sorting

countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len', ascending=False)
countries_reviewed.sort_values(by=['country', 'len'])
countries_reviewed.sort_index() # sort by index

# Data Types 

reviews.dtype #Show the type of data
reviews.points.astype('float64') #convert datatype

# Missing data

df[pd.isnull(df.column_name)]
df.column_name.fillna('Unknown')
df.column_name.replace('from', 'to')

# Renaming

df.rename(columns={'from':'to'})
df.rename(index={'from':'to'})

'''Both the row index and the column index can have their own name attribute. The complimentary rename_axis() method may be used to change these names. For example:'''
reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')

# Combining

#concat
pd.concat([canadian_youtube, british_youtube]) #defalut가 axis=0이 적용 > 행방향(위아래)으로 이어붙인다.
#ignore_index=True을 줘서 인덱스를 재배열 할 수 있다.
result3_in = pd.concat([df1,df2],axis=1, join='inner') #열방향(axis=1), 교집합(inner), 합집합(outer)

#merge
merge_outer = pd.merge(df1,df2, how='outer',on='id')
merge_left = pd.merge(df1,df2, how='left', left_on='stock_name', right_on='name')

#join: index 기반 동작 (merge 기반으로 만들어짐)
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])
left.join(right, lsuffix='_CAN', rsuffix='_UK', how='inner/outer')

'Data Science Series' 카테고리의 다른 글

HDBSCAN (Hierarchical DBSCAN) (0)	2021.05.25
[Kaggle Visualization] Python basic code (0)	2021.04.28
[Kaggle Feature Engineering] Python basic code (0)	2021.04.28
[Kaggle Data Cleaning] Python basic code (0)	2021.04.28
[Kaggle Intermediate Machine Learning] Python basic code (0)	2021.04.28

'Data Science Series' Related Articles

No Story, No Ecstasy

[Kaggle Pandas] Python basic code 본문

[Kaggle Pandas] Python basic code

'Data Science Series' 카테고리의 다른 글

티스토리툴바